Probabilistic Arabic Part of Speech Tagger with Unknown Words Handling

نویسندگان

Mohammed Albared

Tareq Al-Moslmi

Nazlia Omar

Adel Al-Shabi

چکیده

Part Of Speech (POS) tagger is an essential preprocessing step in many natural language applications. In this paper, we investigate the best configuration of trigram Hidden Markov Model (HMM) Arabic POS tagger when small tagged corpus is available. With small training data, unknown word POS guessing is the main problem. This problem becomes more serious in languages which have huge size of vocabulary and rich and complex morphology like Arabic. In order to handle this problem in Arabic POS tagger, we have studied the effect of integrating a lexicon based morphological analyzer to improve the performance of the tagger. Moreover, in this work, several lexical models have been empirically defined, implemented and evaluated. These models are based essentially on the internal structure and the formation process of Arabic words. Furthermore, several combinations of these models have been presented. The POS tagger has been trained with a training corpus of 29300 words and it uses a tagset of 24 different POS tags. Our system achieves state-of-the-art overall accuracy in Arabic part of speech tagging and outperforms other Arabic taggers in unknown word POS tagging accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Studying impressive parameters on the performance of Persian probabilistic context free grammar parser

In linguistics, a tree bank is a parsed text corpus that annotates syntactic or semantic sentence structure. The exploitation of tree bank data has been important ever since the first large-scale tree bank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of tree bank is becoming more widely appreciated in linguistics research as a whole. F...

متن کامل

ACL - 05 Computational Approaches to Semitic Languages

We explore the application of memorybased learning to morphological analysis and part-of-speech tagging of written Arabic, based on data from the Arabic Treebank. Morphological analysis – the construction of all possible analyses of isolated unvoweled wordforms – is performed as a letter-by-letter operation prediction task, where the operation encodes segmentation, part-of-speech, character cha...

متن کامل

TnT -- A Statistical Part-of-Speech Tagger

Trigrams'n'Tags (TnT) is an efficient statistical part-of-speech tagger. Contrary to claims found elsewhere in the literature, we argue that a tagger based on Markov models performs at least as well as other current approaches, including the Maximum Entropy framework. A recent comparison has even shown that TnT performs significantly better for the tested corpora. We describe the basic model of...

متن کامل